Problems in the Previous Company (2)

2022-03-11

#Work
#Issues

Recently, I noticed the field of service integration in software development and some startups that have achieved success in this direction. Additionally, seeing some people interviewing at the company reminded me of past experiences at my previous company. In a previous article, I didn’t specifically mention that the department’s market positioning was unclear. Of course, there are many historical reasons for this outcome. The company initially had a solid backing as a subsidiary, acting confidently as a client, but later shifted to a vendor role without a market-oriented gene. Here, I will only discuss the results.

If the department’s positioning was to develop products based on technical foundations, then the problem was very serious.

After officially taking over the development of the core product, I found the project’s code to be in engineering disarray. It wasn’t that the code was messy; it lacked top-level design, clear module divisions, a coherent directory structure, and reliable software design. The project supported four or five types of databases, yet the configuration methods for these databases were not unified, scattered in different locations within different configuration files. It was unclear where configurations were effective or ineffective, which functions worked well, and which did not, relying either on experience or guesswork. The API design was also poor. In my article “Thoughts“, I mentioned issues with URL parameters. Additionally, URL configuration and parameter validation were written in a single configuration file, meaning that to add a URL in a smart contract, one had to modify the configuration file and restart the node. As for hot-loading configuration rules, no one seemed to care.

The mechanism of smart contracts had issues as well. After a transaction was submitted to the contract, the transaction check and execution were divided into two steps. The check function’s input and output were bitMaps, and it was crucial to ensure that the input and output lengths were consistent; otherwise, the node would panic due to out-of-index errors in the loop handling. The problem was, it was a smart contract—why use such rigid writing? Later, someone told me the principle of writing smart contracts was “absolutely no errors,” because despite being called smart contracts, they were deeply coupled with project functionalities, named system contracts, developed by the underlying chain developers and not for user utilization. Writing contracts required sufficient understanding of the underlying chain. People seemed proud of this, thinking, “We can write this because we are familiar,” not realizing it was an incomplete functionality, but seeing it as a high-threshold feature. Originally, there was no plan to support smart contracts, but due to expanding needs, it was forcibly added. Someone at the time even revered a leader’s article that said, “Blockchain doesn’t necessarily need smart contracts,” greatly admiring this design contrary to the broad concept of smart contracts.

After writing several contracts, one would notice that each contract’s check function had a statement to check if the transaction was a null pointer. One would think, how could a null transaction appear in internal system function calls? After reproduction and troubleshooting, it was found that in concurrent scenarios, the queue would occasionally be abnormal: pushing a batch of transactions in, popping out would yield null transactions, purely a data structure issue. Although the issue was identified, it wasn’t resolved. Instead, I lazily added a check in the contract each time. Who knows how many hidden bugs were left by previous developers?

When handling business with contracts, database read-write performance would become a bottleneck for transaction performance, such as with MySQL. Additionally, there was a question I never understood: how does the check function in the contract determine if a transaction is successful or failed? Because contracts are developed for specific business purposes. The check and execution of transactions by the contract occur at the commit stage of BFT consensus. Consensus is already reached at this point, and the check function can’t pre-execute database writes (if contracts relied on database features like transactions, blockchain would be meaningless). Should all possible failure scenarios be precluded? Is it semantic or execution-level exclusion? Even if exceptions could be enumerated, how much performance would it cost? One might wonder, why not check before consensus? There are checks before consensus, but regardless of before or after consensus, the total database operations remain unchanged, and so does the performance cost. (Extended thought: Why don’t public chains have this issue, but consortium chains do?)

Version management confusion was another engineering issue. No one could clearly state the current version number. Was it 2.0? The configuration file still said 1.4. Was it 2.0.1? There was also a 2.0.3 in the repository, but no one knew who changed it or what changes were made. Test cases appeared in the main codebase, such as scenarios requiring exception testing in BFT consensus through malicious voting, only observable by modifying the code. Consequently, that part of the code remained in the project, activatable via configuration. Other redundancies existed, such as common interfaces for smart contracts added to be compatible with UTXO, though other contracts didn’t need them but still had to implement them.

The project had various technical issues, including ones I previously solved, like the issue of blacklists failing in the proposal phase due to VRF use, indicating the system was particularly incomplete in core areas, with high costs for refactoring. The project’s greatest value was participating in a well-known industry test. Passing the test would earn the company a certificate from an organization, proving the software was qualified, certified, and up to standard. The company could then use this certificate for promotion, bidding, and sales. The software quality itself was irrelevant. Initially, the test items seemed basic blockchain requirements, nothing significant about passing. But upon personal involvement, I realized the testing process was fraught with difficulties, all human-induced, due to unprofessional developers, lax team management, many unreasonable project designs, and loss of documentation and personnel, greatly increasing test preparation difficulties. Perhaps no one realized what effective difficulties were, creating an illusion of a good project.

If the department’s positioning was service integration, providing solutions and technical support, it was also inadequate.

Besides the blockchain layer, the department had projects like Bass, middleware, SDK, and browsers involving components like Kafka, Zookeeper, Redis, and Prometheus, but their use was superficial with low technical content, overall poorly done. There were no products, UI, user thinking, or owner awareness, with all improvements made ad-hoc based on business needs, resulting in rushed efforts, always prioritizing customer demands. Service integration could be simple or complex, easy to do well or poorly. Yet the department also seemed reluctant about service integration. For instance, when doing something with Hyperledger Fabric, the leader would say, “Clients will ask, ‘You only use Fabric, haven’t done much, why charge?’” This reflects the department’s chaotic positioning.